Introduction
Topic modeling has emerged as a powerful computational technique for
uncovering latent themes and patterns within large textual corpora,
offering valuable insights into the underlying structures of complex
discourse domains. Within the context of the Polish parliament’s
interpellation process, where members engage in diverse debates and
inquiries on matters of national significance, topic modeling presents a
promising approach to discerning the underlying thematic content and
dynamics.
Purpose of the project
By applying topic modeling algorithms to the corpus of interpellation
texts, this study seeks to identify the key issues, policy domains, and
thematic clusters that dominate parliamentary discourse. Through this
analysis, I aim to shed light on the salient topics, evolving
priorities, and ideological fault lines within the Polish political
landscape, thereby contributing to a deeper understanding of legislative
agendas, public policy, and democratic governance in Poland.
Main assumptions (unexamined beliefs about the case)
- There are specific topics emerge as dominant themes across the
interpellation corpus, indicating their prevalence and importance within
the legislative agenda.
- Importance of topics may vary across different parliamentary
sessions, political factions, or policy domains, reflecting shifts in
public attention, government priorities, or societal concerns over
time.
Data
Description of the data set
Data comes from The
Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego
repository. That repository contains both Sejm and Senat
proceedings, interpellation etc. from 1919 to 2023. The analysis is
limited to 3 parliament terms (1997-2001, 2001-2005 and 2005-200) and
only to data from Member of Parliament interpellations. Based on the
files stored in that repository the dataset for the analysis (both the
content and available metadata) has been created and preprocessed before
the analysis via external scripts link to reposiotry
of this project). To stem data KRNNT tool
was used. Preprocessed files are available on Google
Drive.
Preparation of data for modeling
There will be three main steps:
- retrieve date of interpellation since it was available in
metadata
- make sure author is ok - it is needed for proper network
analysis
- clean stemmed data
Load data
First step - load data from preprocessed files
pi_files <- list.files("data", pattern = "^(pi_)")
all_pi <- do.call(rbind,
lapply(X = pi_files, FUN = function(i){
mydb <- dbConnect(SQLite(), file.path("data", i))
this_cont <- dbGetQuery(conn = mydb,
statement = "select metadata.*, ipcontent.content from ipcontent left join metadata on metadata.id = ipcontent.id")
dbDisconnect(mydb)
this_cont$period <- gsub("pi_", replacement = "", x = gsub(pattern = ".sqlite", replacement = "", x = i))
this_cont
}))
# add information about length of each document
all_pi <- all_pi %>%
mutate(len = nchar(CONTENT))
# create dataset with start and end of each period
period_bands <- data.frame(period = c("1997-2001", "2001-2005", "2005-2007"),
start_date = as.Date(c("1997-10-20", "2001-10-19", "2005-10-19")),
end_date = as.Date(c("2001-10-18", "2005-10-18", "2007-09-07")))
Create date of interpellation
Second step - create date of creation of each interpellation
# assumption - each document ends in date passed in that format: <day> <month as a word> <year>
last_part_len <- 24
vec_to_extract_dates <- all_pi$CONTENT %>% substr(start = nchar(.) - last_part_len, nchar(.))
# there are some cases that will be overwritten and not treated with general approach (59 out of 28074 [0.2%]):
# because the last part of document did not include date (e.g. it consists of footnotes of some sort)
# because there were mistakes (e.g. year 3000)
list_exceptions <- fromJSON(txt = "dicts/pi_dates.json")
vec_to_extract_dates[match(x = names(list_exceptions), table = all_pi$DOC)] <- unlist(list_exceptions)
# remove last part of date
vec_to_extract_dates <- gsub(pattern = "(r|roku)\\s*\\.*\\,*$", replacement = "", x = vec_to_extract_dates)
vec_to_extract_dates <- trimws(vec_to_extract_dates)
# extract year
years <- as.numeric(substr(vec_to_extract_dates, regexpr(pattern = "\\d+\\.*$", text = vec_to_extract_dates), nchar(vec_to_extract_dates)))
# check if everything is ok
if (any(is.na(years))) {
idx <- which(is.na(years))
message(length(idx))
vec_to_extract_dates[idx[1:min(10, length(idx))]]
}
if (!all(years %in% c(1997:2007))) {
idx <- which(!years %in% c(1997:2007))
message(length(idx))
vec_to_extract_dates[idx[1:min(10, length(idx))]]
}
# remove year to ease the process
vec_to_extract_dates <- trimws(substr(vec_to_extract_dates, 1, nchar(vec_to_extract_dates) - 4))
# extract month
months_vec <- rep(0, length = length(vec_to_extract_dates))
months_dict <- list("1" = "stycznia",
"2" = c("lutego", "luty"),
"3" = "marca",
"4" = c("kwietnia", "kwietnie", "kwitnia", "kwietna"),
"5" = "maja",
"6" = "czerwca",
"7" = "lipca",
"8" = "sierpnia",
"9" = c("września", "wrzesień"),
"10" = "października",
"11" = c("listopada", "listopad"),
"12" = "grudnia")
for (m in 1:length(months_dict)) {
vals <- regexpr(pattern = sprintf("(%s)", paste0(months_dict[[m]], collapse = "|")), text = vec_to_extract_dates)
idx <- which(vals > 0)
months_vec[idx] <- as.numeric(names(months_dict)[m])
vec_to_extract_dates[idx] <- substr(vec_to_extract_dates[idx], 1, vals[idx] -1)
}
# check if everything is ok
if (!all(months_vec %in% c(1:12))) {
idx <- which(!months_vec %in% c(1:12))
message(length(idx))
vec_to_extract_dates[idx[1:min(10, length(idx))]]
}
vec_to_extract_dates <- trimws(vec_to_extract_dates)
# extract day
days_vec <- as.numeric(substr(vec_to_extract_dates, regexpr(pattern = "\\d+$", text = vec_to_extract_dates), nchar(vec_to_extract_dates)))
# check if everything is ok
if (!all(days_vec %in% c(1:31))) {
idx <- which(!days_vec %in% c(1:31))
message(length(idx))
vec_to_extract_dates[idx[1:min(10, length(idx))]]
}
# add to dataset
all_pi$date <- as.Date(sprintf("%s-%02d-%02d", years, months_vec, days_vec))
Check if all is ok - each date should be between start and end of
each period
all_pi %>%
left_join(y = period_bands, by = "period") %>%
mutate(days_after_start = as.numeric(date - start_date)) %>%
filter(days_after_start < 0) %>%
kable() %>%
kable_styling()
|
ID
|
AUTHOR
|
DOC
|
CONTENT
|
period
|
len
|
date
|
start_date
|
end_date
|
days_after_start
|
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
NA
|
|
–:
|
:——
|
:—
|
:——-
|
:——
|
—:
|
:—-
|
:———-
|
:——–
|
—————-:
|
But sometimes interpellation is written in a period
t is delivered to Sejm in period
t-1.
all_pi %>%
left_join(y = period_bands, by = "period") %>%
mutate(days_before_end = as.numeric(end_date - date)) %>%
filter(days_before_end < 0) %>%
summarise(n(), min(days_before_end), median(days_before_end), max(days_before_end)) %>%
kable() %>%
kable_styling()
|
n()
|
min(days_before_end)
|
median(days_before_end)
|
max(days_before_end)
|
|
31
|
-172
|
-11
|
-1
|
Clean up authors
Third step - cleaning up data regarding author of interpellation
all_pi %>% group_by(period) %>%
summarise(n_authors = length(unique(AUTHOR))) %>%
kable() %>%
kable_styling()
|
period
|
n_authors
|
|
1997-2001
|
559
|
|
2001-2005
|
687
|
|
2005-2007
|
686
|
This is strange, those supposed to be parliamentary interpellations.
In Polish Sejm there is 460 members, but data for each term suggest
there are many more of them.
Couple things to consider:
- member of parliament was sometimes described with first and second
name (‘Adam Bielan’ and ‘Adam Jerzy Bielan’ is the same person)
- couple of members cosign one interpellation (e.g. ‘Adam Bielan i
Zbigniew Ziobro’)
- in course of term member’s list might change due to different things
(death, resigning etc.)
Load the member list. This is a list of member for each term at the
end of the term.
# for further analysis
posl_json <- fromJSON(txt = "dicts/poslowie.json")
posl_df <- do.call(rbind,
lapply(names(posl_json),
FUN = function(i){
posl_json_period <- posl_json[[i]]
posl_df <- do.call(rbind,
lapply(1:length(posl_json_period),
FUN = function(j){
nn <- names(posl_json_period)[j]
ps <- unlist(posl_json_period[[nn]])
data.frame(AUTHOR = ps, ugr = names(posl_json_period)[j])
}))
row.names(posl_df) <- NULL
posl_df$period <- i
posl_df
}))
Is everything is ok with that data?
posl_df %>%
group_by(period) %>%
summarise(n = n(), uq = length(unique(AUTHOR)), any_duplicates = n > uq, members_dup = paste0(AUTHOR[duplicated(AUTHOR)], collapse = ", ")) %>%
kable() %>%
kable_styling()
|
period
|
n
|
uq
|
any_duplicates
|
members_dup
|
|
1997-2001
|
460
|
459
|
TRUE
|
Maciej Jankowski
|
|
2001-2005
|
460
|
459
|
TRUE
|
Ewa Janik
|
|
2005-2007
|
460
|
460
|
FALSE
|
|
There are duplicates in two periods. Let see how many interpellations
are for those members.
all_pi %>% filter(regexpr(text = AUTHOR, pattern = "(Ewa Janik)|(Maciej Jankowski)") > 0) %>%
group_by(period, AUTHOR) %>% summarise(n()) %>%
kable() %>%
kable_styling()
|
period
|
AUTHOR
|
n()
|
|
1997-2001
|
Ewa Janik
|
29
|
|
1997-2001
|
Ewa Janik i Seweryn Kaczmarek
|
2
|
|
1997-2001
|
Ewa Janik i Władysław Szkop
|
2
|
|
1997-2001
|
Maciej Jankowski
|
2
|
|
2005-2007
|
Ewa Janik
|
3
|
Apparently there are only 2 (out of 28k) observations are for the
periods where there are mulitple names on member list. For simplicity
I’ll remove one of those members.
posl_df <- posl_df %>% filter(!(AUTHOR == "Maciej Jankowski" & ugr == "Posłowie niezrzeszeni" & period == "1997-2001"))
posl_df <- posl_df %>% filter(!(AUTHOR == "Ewa Janik" & period == "2001-2005"))
posl_df <- rbind(posl_df, data.frame(AUTHOR = "Ewa Janik", period = "2001-2005", ugr = "Klub Parlamentarny Sojuszu Lewicy Demokratycznej"))
Check how many interpellations are not accounted for author’s
party
all_pi %>%
left_join(posl_df, by = c("AUTHOR", "period")) %>%
group_by(period) %>%
summarise(n_mem = n_distinct(AUTHOR), count = n(), without_party = sum(is.na(ugr)) / count,
n_mem_without_party = length(unique(AUTHOR[is.na(ugr)])), without_part_count = n_mem_without_party / n_mem) %>%
kable() %>%
kable_styling()
|
period
|
n_mem
|
count
|
without_party
|
n_mem_without_party
|
without_part_count
|
|
1997-2001
|
559
|
7444
|
0.1575766
|
276
|
0.4937388
|
|
2001-2005
|
687
|
10906
|
0.1444159
|
322
|
0.4687045
|
|
2005-2007
|
686
|
9724
|
0.2317976
|
317
|
0.4620991
|
- good news: between 1 in 7 and 1 in 5 interpollation is not accounted
for auther (not great, not terrible) (column
without_party)
- bad news: half of authors are not recognized (column
without_part_count)
Let see which authors do not rest in posel list
all_pi %>%
left_join(posl_df, by = c("AUTHOR", "period")) %>%
group_by(AUTHOR, ugr) %>%
summarise(k = n()) %>%
filter(is.na(ugr)) %>%
head() %>%
kable() %>%
kable_styling()
|
AUTHOR
|
ugr
|
k
|
|
Adam Bielan
|
NA
|
6
|
|
Adam Bielan i Zbigniew Ziobro
|
NA
|
4
|
|
Adam Jerzy Bielan
|
NA
|
5
|
|
Adam Markiewicz i Andrzej Otręba
|
NA
|
1
|
|
Adam Ołdakowski i Maria Zbyrowska
|
NA
|
1
|
|
Adam Ołdakowski i Józef Stępkowski
|
NA
|
31
|
As stated before
- member was sometimes described with first and second name (‘Adam
Bielan’ vs ‘Adam Jerzy Bielan’)
- couple of members cosign one interpellation (e.g. ‘Adam Bielan i
Zbigniew Ziobro’)
# create dict author + period
un_authors <- all_pi %>% select(AUTHOR, period) %>% unique()
un_authors <- un_authors %>% left_join(y = posl_df, by = c("AUTHOR", "period"))
To clean up that mess I’ll introduce:
- one manual change
- assume that first member that signs the interpellation is its
author
- if posel uses second name shorten it to first name and surname
# some manual changes
un_authors$AUTHOR[un_authors$AUTHOR == "łukasz Zbonikowski"] <- "Łukasz Zbonikowski"
# assumption - if multiple poeple sign a interplataion - assign it to the first on the list (most proboalby person responisbile for phrasing)
# most likely it will be Jan Kowalski i Zbigniew Nowak (so they are separated by small i)
# if posel uses second name - shorten it to first name and surname
un_authors$stripped <- lapply(X = un_authors$AUTHOR,
FUN = function(i){
if (i == "na") return(i)
gg <- strsplit(i, split = "(\\s|Senator|Poseł|Posel|Poslowie|Z należytym szacunkiem)")[[1]]
gg <- trimws(gg)
gg <- gg[nchar(gg) > 0]
ggs <- regexpr(pattern = "^[[:upper:]]", text = gg) > 0
first_zero <- which(!ggs)[1]
if (is.na(first_zero)) first_zero <- length(ggs) + 1
last_one <- max(which(ggs[1:(first_zero - 1)]))
paste0(gg[c(1, last_one)], collapse = " ")
}) %>% unlist()
# join it with party dictionairy
un_authors2 <- un_authors %>% left_join(y = posl_df %>% select(stripped = AUTHOR, period, ugr2 = ugr), by = c("stripped", "period"))
# repeat manual changes and join with text data and see if that he
all_pi$AUTHOR[all_pi$AUTHOR == "łukasz Zbonikowski"] <- "Łukasz Zbonikowski"
all_pi %>%
left_join(un_authors2 %>% select(AUTHOR, period, ugr), by = c("AUTHOR", "period")) %>%
left_join(un_authors2 %>% select(AUTHOR, stripped, period, ugr2) %>% unique(), by = c("AUTHOR", "period")) %>%
group_by(period) %>%
summarise(count_original = n_distinct(AUTHOR), count_clean = n_distinct(stripped), count_int = n(),
perc_without_original = sum(is.na(ugr)) / count_int, perc_without_clean = sum(is.na(ugr2)) / count_int) %>%
kable() %>%
kable_styling()
|
period
|
count_original
|
count_clean
|
count_int
|
perc_without_original
|
perc_without_clean
|
|
1997-2001
|
559
|
339
|
7444
|
0.1575766
|
0.0569586
|
|
2001-2005
|
687
|
435
|
10906
|
0.1444159
|
0.0790391
|
|
2005-2007
|
685
|
429
|
9752
|
0.2335931
|
0.0796760
|
After clean up, instead of 1 in 5 interpellations without an author
now I have only 1 in 12 missing.
all_pi <- all_pi %>%
left_join(un_authors2 %>% select(AUTHOR, AUTHOR_CLEAN = stripped, period, party = ugr2) %>% unique(), by = c("AUTHOR", "period"))
Clean stemmed data
The data was stem outsied the script - load it
# I've tagged already
pi_tagged_files <- paste0("tagged_", pi_files)
all_pi_tagged <- do.call(rbind,
lapply(X = pi_tagged_files,
FUN = function(i){
mydb <- dbConnect(SQLite(), file.path("data", i))
this_cont <- dbGetQuery(conn = mydb, statement = "select * from tagged")
dbDisconnect(mydb)
this_cont$period <- gsub("tagged_pi_", replacement = "", x = gsub(pattern = ".sqlite", replacement = "", x = i))
this_cont
}))
head(all_pi_tagged) %>%
kable() %>%
kable_styling()
|
ID
|
ORG
|
TAGGED
|
period
|
|
1
|
Na
|
na
|
1997-2001
|
|
1
|
początku
|
początek
|
1997-2001
|
|
1
|
bieżącego
|
bieżący
|
1997-2001
|
|
1
|
roku
|
rok
|
1997-2001
|
|
1
|
zwrócił
|
zwrócić
|
1997-2001
|
|
1
|
em
|
być
|
1997-2001
|
The data is stripped from whitespaces and lowercased (unless it is a
recognizable name). How many tags (stemmed entities) do I have?
n_distinct(all_pi_tagged$TAGGED)
## [1] 64840
Cleaning:
- remove words that correspond to dates
- remove polish stop words
- remove number (in several formats)
- remove dates (in several formats)
- remove words that are most likely in a greeting or in the signature
part of interpellation
# cleaning
all_pi_tagged_clean <- all_pi_tagged %>%
# remove part with date
anti_join(y = data.frame(TAGGED = c("dzień", "styczeń", "luty", "marzec", "kwiecień", "maj", "czerwiec", "lipiec", "sierpień", "wrzesień",
"październik", "listopad", "grudzień", "rok")), by = "TAGGED") %>%
# remove stopwords
anti_join(y = polish_stop_words_df("TAGGED"))
# remove numbers
all_pi_tagged_clean <- all_pi_tagged_clean[regexpr(pattern = "^\\d+$", text = all_pi_tagged_clean$TAGGED) < 0, ]
# %, +
all_pi_tagged_clean <- all_pi_tagged_clean[regexpr(pattern = "^\\W$", text = all_pi_tagged_clean$TAGGED) < 0, ]
# numbers with comma as decimal point
all_pi_tagged_clean <- all_pi_tagged_clean[regexpr(pattern = "^\\d+,\\d+$", text = all_pi_tagged_clean$TAGGED) < 0, ]
# numbers with dot as decimal point
all_pi_tagged_clean <- all_pi_tagged_clean[regexpr(pattern = "^\\d+\\.\\d+$", text = all_pi_tagged_clean$TAGGED) < 0, ]
# remove dates xx.xx.xxxx
all_pi_tagged_clean <- all_pi_tagged_clean[regexpr(pattern = "^\\d{1,2}\\.(0\\d{1}|\\d{1,2})\\.(\\d{2}|\\d{4})$", text = all_pi_tagged_clean$TAGGED) < 0, ]
# remove dates xx-xx-xxxx
all_pi_tagged_clean <- all_pi_tagged_clean[regexpr(pattern = "^\\d{1,2}-(0\\d{1}|\\d{1,2})-(\\d{2}|\\d{4})$", text = all_pi_tagged_clean$TAGGED) < 0, ]
# remove words like 'szanowny', 'minister', 'premier' because they come from a greeting at the beginnig of the interpellation or at the end ("poseł", "poważanie")
all_pi_tagged_clean <- all_pi_tagged_clean %>% filter(!TAGGED %in% c("szanowny", "minister", "premier", "poseł", "poważanie"))
How many tags are in the end?
un_tagged <- sort(unique(all_pi_tagged_clean$TAGGED))
length(un_tagged)
## [1] 56516
In order to incorporate Text Mining algorithms I’ll create documents
from processed stemmed data.
all_pi_tagged <- all_pi_tagged_clean %>%
group_by(period, ID) %>%
summarise(CONTENT = paste0(TAGGED, collapse = " "))
head(all_pi_tagged, 2) %>%
kable() %>%
kable_styling()
|
period
|
ID
|
CONTENT
|
|
1997-2001
|
1
|
początek bieżący zwrócić zdrowie sprawa przestrzegać przepis ustawa
dotyczyć dopuszczalność przerywać ciąża wskazać niektóry województwo
województwo kielecki obowiązywać ustawa kobieta chcieć skorzystać
przysługiwać prawo mieć możliwość wynikać fakt ginekolog publiczny
placówka opieka zdrowotny gremialnie odmawiać wykonać dopuszczalny
prawnie zabieg spełnić zainteresowany kobieta przewidzieć przepis wymóg
zaznaczyć przepis dawać określić prawo kobieta mieć obowiązek
uprawnienie gwarantować odpowiedź zdrowie zobowiązać podjąć działanie
zmierzać województwo kielecki doprowadzić prawidłowy funkcjonować ustawa
zgodzić chyba móc województwo mieć placówka kobieta móc liczyć prawo
uszanować niezrozumiały chóralny powoływać ginekolog kodeks etyka
lekarski często ginekolog wykonywać zabieg aborcja prywatnie zacisze
własny gabinet oczywiście odpowiedni opłata oceniać szczyt hipokryzja
pomijać kodeks zawodowy móc sytuować obowiązujący prawo podstawowy
problem brzmieć kwestia przysługiwać kobieta prawo rozwiązać zgodnie
przepis ustawa dotąd nikt przepis zmienić wydać wejście życie nowy
konstytucja orzeczenie trybunał konstytucyjny mieć moc stanowić mieć
sytuacja obowiązujący prawo przypadek prawo papier chcieć spytać zatem
raz działanie resort zdrowie podjąć podjąć przysługiwać kobieta prawo
przestrzegać wreszcie sprawa zostać rozwiązać gwarantować kobieta
minimum godność prosić konkretny informacja efekt dotychczasowy
działanie podjąć województwo kielecki ostatni czas zgłaszać biuro
poselski kobieta wskazywać prywatnie pieniądz zabieg móc mieć wykonać
jednocześnie lekarz wojewódzki stan wskazać placówka móc mieć wykonać
zabieg darmo uważać sytuacja wyjątkowo bulwersujący Władysław Adamski
Warszawa
|
|
1997-2001
|
2
|
trwać protest anestezjolog przeciwny polityka rząd RP dyskryminować
lekarz okręg gliwicki głodować anestezjolog szpital główny postulat
wyposażyć szpital sprzęt anestezjologiczny często stan użyć zagrażać
życie zdrowie pacjent wzrost wynagrodzenie lekarz anestezjolog
maksymalny stawka dany grupa zaszeregowanie plus premia kwota niski
średni krajowy lekarz wojewódzki A Sośnierz rozmowa głodować lekarz
wygospodarować pieniądz rozesłać dyrektor szpital niestety podwyżka
rezultat rząd złoty zdrowie lekarz podległy pacjent zagrozić sprawa
oczywisty wymagać natychmiastowy realizacja zwracać Jan Olszewski
Warszawa
|
Basic information
Basic information regarding data
all_pi %>% group_by(period) %>%
summarise(count = n(), autohrs = n_distinct(AUTHOR_CLEAN), av_len = mean(len), sd_len = sd(len), min_len = min(len), max_len = max(len)) %>%
kable() %>%
kable_styling()
|
period
|
count
|
autohrs
|
av_len
|
sd_len
|
min_len
|
max_len
|
|
1997-2001
|
7444
|
339
|
2002.257
|
1371.682
|
260
|
20858
|
|
2001-2005
|
10906
|
435
|
2377.523
|
2084.135
|
65
|
53942
|
|
2005-2007
|
9724
|
429
|
2274.355
|
1594.731
|
82
|
20618
|
For each period there is between 7.4k and 10.9k documents of average
length of 2k characters. See how the interpolation number by date
changed.
all_pi %>%
group_by(date) %>%
summarise(count = n()) %>%
ggplot(aes(x = date, y = count)) +
geom_col() +
geom_smooth() +
theme_minimal()

One can clearly see that the number of interpollation increases in
time with some seasonal variations. Lets interpollation number by period
but counting from the start of the term.
all_pi %>%
left_join(y = period_bands, by = "period") %>%
mutate(days_by = as.numeric(date - start_date)) %>%
group_by(days_by, period) %>%
summarise(count = n()) %>%
ggplot(aes(x = days_by, y = count, group = period)) +
geom_point(alpha = .1) +
geom_smooth() +
facet_wrap(~period, scales = "free") +
theme_minimal()

Which words appear the most?
summ_data <- all_pi_tagged_clean %>%
left_join(y = all_pi %>% select(ID, period, date)) %>%
group_by(TAGGED, date) %>%
summarise(count = n(), n_ids = n_distinct(ID)) %>%
ungroup() %>%
arrange(TAGGED, date)
First look at the count statistics. It measures how
many times in a single day the word occurred.
max_tagged <- summ_data %>% group_by(TAGGED) %>% summarise(count_mx = max(count)) %>% top_n(n = 10, wt = count_mx) %>% .$TAGGED
show_ridge(summ_data = summ_data, tt = max_tagged) +
theme_minimal()

There are visible spikes with words like zarząd,
spółka, komisaryczny,
FSO, daewoo in similar time. This most
probably corresponds to the issue of bankruptcy of Daewoo with was main
partner of FSO (link1
and link2).
Another way to look is for number of documents in which terms
occured.
max_ids <- summ_data %>% group_by(TAGGED) %>% summarise(ids_mx = max(n_ids)) %>% top_n(n = 10, wt = ids_mx) %>% .$TAGGED
show_ridge(summ_data = summ_data, tt = max_ids) +
theme_minimal()

There is little to none information value from those words. All of
them corresponds to the fact of simply asking the question
(pytanie, sprawa etc.). Even word
Warszawa might not realte to the interpollation subject
but rather comes from the interpollation signature.
For what words the frequency fluctuate mostly?
kk <- summ_data %>% group_by(TAGGED) %>%
summarise(n_days = as.numeric(max(date) - min(date)), mn = sd(count)) %>%
filter(n_days > 0)
show_ridge(summ_data = summ_data, tt = kk %>% filter(n_days > 100) %>% top_n(n = 15, mn) %>% .$TAGGED) +
theme_minimal()

Other than words mentioned earlier
Clusterization
For each period I’ll perform these steps:
- create dtm object
- based on dtm object create tf_mat
object
- based on tf_mat object create
cdist object
- based on dtm object create p_words
object
dtms <- lapply(X = period_bands$period, FUN = function(x) dtm_from_text(all_pi_tagged %>% filter(period == x), "CONTENT", "ID"))
names(dtms) <- period_bands$period
# develop the matrix of term counts to get the IDF vector
tf_mats <- lapply(X = period_bands$period, FUN = function(x) TermDocFreq(dtms[[x]]))
names(tf_mats) <- period_bands$period
# calculate distances
cdists <- lapply(X = period_bands$period, FUN = function(x) calc_cdist(dtms[[x]], tf_mats[[x]]))
names(cdists) <- period_bands$period
# use the probability difference method
p_words <- lapply(X = period_bands$period, FUN = function(x) colSums(dtms[[x]]) / sum(dtms[[x]]))
names(p_words) <- period_bands$period
Each of the tabset below consists analysis for each term. Some of the
key insights:
- set of the most frequent words in all of terms is similar:
Polska, polski, ustawa, mieć,
móc, sprawa, prawo, praca
- in 1997-2001 clusterization analysis revealed such topics as
healthcare (zdrowie, chory,
zdrowotny, kasa, opieka), state
treasury (spółka, SA, skarb,
skarb_państwo, państwo), education
(szkoła, nauczyciel, dziecko, gmina,
edukacja), social security
(ubezpieczenie, składka, społeczny,
ubezpieczenie_społeczny, ZUS), taxes
(podatek, podatkowy, podatnik,
skarbowy, artykuł), housing
(mieszkaniowy, lokal, mieszkanie,
spółdzielnia, budynek), job market
(praca, urząd, urząd_praca,
bezrobocie, środek) and social
benefits (pomoc, społeczny,
pomoc_społeczny, rodzina, dom)
- the topics in 2001-2005 were partially similar:
treasury (spółka, SA, skarb,
skarb_państwo, państwo), education
(szkoła, dziecko, nauczyciel,
rodzina, uczeń), job market and social
insurance (praca, ubezpieczenie,
społeczny, ZUS, emerytalny),
disability (niepełnosprawny, osoba,
osoba_niepełnosprawny, praca, rehabilitacja),
infrastructure (droga, autostrada,
budowa, kolejowy, PKP),
healthcare (zdrowie, szpital,
zdrowotny, pacjent, chory) and
taxes (podatek, VAT, usługa,
towar, stawka)
- in 2005-2007 clusters corresponded to general topics such as
Polands’ development (polski, spółka,
Polska, sprawa, rozwój),
education (szkoła, nauczyciel,
uczeń, egzamin, uczelnia), real
estate (podatek, nieruchomość,
ustawa, podatkowy, mieszkaniowy),
social benefits (praca, osoba,
społeczny, dziecko, świadczenie),
healthcare (zdrowie, szpital,
lekarz, lek, pacjent),
infrastructure (droga, budowa,
autostrada, odcinek, krajowy) and very
specific as HCV (HCV, wątroba,
typ, leczenie, zapalenie), donating
blood (krew, honorowy, krwiodawca,
darowizna, podatek) or building
supervision (budowlany, nadzór,
nadzór_budowlany, inspektorat,
powiatowy)
1997-2001
Top words (term frequency)
nn <- "1997-2001"
show_top(tf_mats[[nn]], title = "Top words")

Top words (doc frequency)
show_top(tf_mats[[nn]], wt = "doc_freq", "Top words (doc frequency)")

Ward’s clusterization (9 clusters)
n_clust <- 9
show_hierarch(cdists[[nn]], n_clust = n_clust)

Top words in each cluster
clustering <- calc_clust(cdists[[nn]], n = n_clust)
cluster_words <- lapply(X = unique(clustering),
FUN = function(x){
rows <- dtms[[1]][ clustering == x , ]
rows <- rows[ , colSums(rows) > 0 ]
colSums(rows) / sum(rows) - p_words[[1]][ colnames(rows) ]
})
cluster_summary <- data.frame(cluster = unique(clustering),
size = as.numeric(table(clustering)),
top_words = sapply(cluster_words, function(d){
paste(
names(d)[ order(d, decreasing = TRUE) ][ 1:5 ],
collapse = ", ")
}),
stringsAsFactors = FALSE)
cluster_summary %>%
kable() %>%
kable_styling()
|
cluster
|
size
|
top_words
|
|
1
|
4802
|
polski, Polska, sprawa, europejski, gospodarka
|
|
2
|
603
|
zdrowie, chory, zdrowotny, kasa, opieka
|
|
3
|
393
|
spółka, SA, skarb, skarb_państwo, państwo
|
|
4
|
645
|
szkoła, nauczyciel, dziecko, gmina, edukacja
|
|
5
|
267
|
ubezpieczenie, składka, społeczny, ubezpieczenie_społeczny, ZUS
|
|
6
|
208
|
podatek, podatkowy, podatnik, skarbowy, artykuł
|
|
7
|
158
|
mieszkaniowy, lokal, mieszkanie, spółdzielnia, budynek
|
|
8
|
238
|
praca, urząd, urząd_praca, bezrobocie, środek
|
|
9
|
130
|
pomoc, społeczny, pomoc_społeczny, rodzina, dom
|
Wordcloud for 2nd cluster
plot_word_cloud(cluster_words, specific_cluster = 2)

2001-2005
Top words (term frequency)
nn <- "2001-2005"
show_top(tf_mats[[nn]], title = "Top words")

Top words (doc frequency)
show_top(tf_mats[[nn]], wt = "doc_freq", "Top words (doc frequency)")

Ward’s clusterization (8 clusters)
n_clust <- 8
show_hierarch(cdists[[nn]], n_clust = n_clust)

Top words in each cluster
clustering <- calc_clust(cdists[[nn]], n = n_clust)
cluster_words <- lapply(X = unique(clustering),
FUN = function(x){
rows <- dtms[[nn]][ clustering == x , ]
rows <- rows[ , colSums(rows) > 0 ]
colSums(rows) / sum(rows) - p_words[[nn]][ colnames(rows) ]
})
cluster_summary <- data.frame(cluster = unique(clustering),
size = as.numeric(table(clustering)),
top_words = sapply(cluster_words, function(d){
paste(
names(d)[ order(d, decreasing = TRUE) ][ 1:5 ],
collapse = ", ")
}),
stringsAsFactors = FALSE)
cluster_summary %>%
kable() %>%
kable_styling()
|
cluster
|
size
|
top_words
|
|
1
|
6618
|
polski, sprawa, sąd, Polska, prawo
|
|
2
|
1081
|
spółka, SA, skarb, skarb_państwo, państwo
|
|
3
|
907
|
szkoła, dziecko, nauczyciel, rodzina, uczeń
|
|
4
|
416
|
praca, ubezpieczenie, społeczny, ZUS, emerytalny
|
|
5
|
130
|
niepełnosprawny, osoba, osoba_niepełnosprawny, praca, rehabilitacja
|
|
6
|
431
|
droga, autostrada, budowa, kolejowy, PKP
|
|
7
|
1118
|
zdrowie, szpital, zdrowotny, pacjent, chory
|
|
8
|
205
|
podatek, VAT, usługa, towar, stawka
|
Wordcloud for 3rd cluster
plot_word_cloud(cluster_words, specific_cluster = 3)

2005-2007
Top words (term frequency)
nn <- "2005-2007"
show_top(tf_mats[[nn]], title = "Top words")

Top words (doc frequency)
show_top(tf_mats[[nn]], wt = "doc_freq", "Top words (doc frequency)")

Ward’s clusterization (9 clusters)
n_clust <- 9
show_hierarch(cdists[[nn]], n_clust = n_clust)

Top words in each cluster
clustering <- calc_clust(cdists[[nn]], n = n_clust)
cluster_words <- lapply(X = unique(clustering),
FUN = function(x){
rows <- dtms[[nn]][ clustering == x , ]
rows <- rows[ , colSums(rows) > 0 ]
colSums(rows) / sum(rows) - p_words[[nn]][ colnames(rows) ]
})
cluster_summary <- data.frame(cluster = unique(clustering),
size = as.numeric(table(clustering)),
top_words = sapply(cluster_words, function(d){
paste(
names(d)[ order(d, decreasing = TRUE) ][ 1:5 ],
collapse = ", ")
}),
stringsAsFactors = FALSE)
cluster_summary %>%
kable() %>%
kable_styling()
|
cluster
|
size
|
top_words
|
|
1
|
5713
|
polski, spółka, Polska, sprawa, rozwój
|
|
2
|
563
|
szkoła, nauczyciel, uczeń, egzamin, uczelnia
|
|
3
|
1180
|
podatek, nieruchomość, ustawa, podatkowy, mieszkaniowy
|
|
4
|
35
|
HCV, wątroba, typ, leczenie, zapalenie
|
|
5
|
1088
|
praca, osoba, społeczny, dziecko, świadczenie
|
|
6
|
690
|
zdrowie, szpital, lekarz, lek, pacjent
|
|
7
|
357
|
droga, budowa, autostrada, odcinek, krajowy
|
|
8
|
51
|
krew, honorowy, krwiodawca, darowizna, podatek
|
|
9
|
47
|
budowlany, nadzór, nadzór_budowlany, inspektorat, powiatowy
|
Wordcloud for 7th cluster
plot_word_cloud(cluster_words, specific_cluster = 7)

Topic modelling
Initial clustering gave an idea regarding probable topics in each
term. Another way to tackle this problem is to use topic modelling.
To start, for each term, I’ll create list of top 6 topic using
Gibbs method (although the analyed number of cluster
for each period was higher this is time consuming operation)
lda_outs <- lapply(X = period_bands$period,
FUN = function(x)
LDA(dtms[[x]],
k = 6,
method = "Gibbs",
control = list(seed = 306068)))
names(lda_outs) <- period_bands$period
Results of topic modelling for the 1997-2001 term
plot_topics_from_lda_out(lda_out = lda_outs[[1]]) + labs(title = "1997-2001 period")

Results of topic modelling for the 2001-2005 term
plot_topics_from_lda_out(lda_out = lda_outs[[2]]) + labs(title = "2001-2005 period")

Results of topic modelling for the 2005-2007 term
plot_topics_from_lda_out(lda_out = lda_outs[[3]]) + labs(title = "2005-2007 period")

I’ll try to show the results differently
show_topics_diff_periods(lda_outs, sa = c(.5, 1), st = c(5, 8.1))

In 1997-2001 period 6 major topics were:
- state treasury (polski, skarb,
przedsiębiorstwo)
- job market (praca, pracownik)
- law (ustawa, prawo)
- regional state issues (środek, województwo,
gminna)
- healthcare (zdrowie, opieka, chory)
In 2001-2005 period major topics were :
- foreign policy (europejski, kraj)
- job market (praca, pracownik)
- economy (spółka, firma, zakład)
- court (sprawa, sąd)
- regional state issues (środek, województwo,
gminna)
- law (ustawa, prawo)
In 2005-2007 term major topic were:
- infrastructure (droga, budowa)
- domestic economy (polski, firma,
spółka)
- social issues (praca, społeczny,
dziecko)
- general issueas (sprawa, służba)
- law (ustawa, prawo)
- healthcare (zdrowie, szpital)
To quantify how the topics change more gradually I’ll split the
dataset into overlapping periods. Graph below shows such split (lines
present number of interpellation in each period).
new_periods <- do.call(rbind,
lapply(X = period_bands$period,
FUN = function(x){
tt <- period_bands %>% filter(period == x)
n_days <- 180
ta <- seq(from = tt$start_date, by = n_days,
length.out = ceiling((tt$end_date - tt$start_date) / n_days))
ta[length(ta)] <- tt$end_date
overlap <- 1 # number of periods to overlap
cbind(data.frame(period = x),
data.frame(st = ta[1:(length(ta) - (1 + overlap))], et = ta[(2 + overlap):length(ta)]))
}))
new_periods <- new_periods %>%
mutate(p_label = paste0(period, " (", format(st, "%Y-%m"), " - ", format(et, "%Y-%m"), ")"))
do.call(rbind,
lapply(X = 1:nrow(new_periods),
FUN = function(x){
oo <- new_periods[x, ]
res <- all_pi %>% filter(date <= oo$et & date >= oo$st) %>%
summarise(count = n())
cbind(oo, res)
})) %>%
ggplot(mapping = aes(x = et, xend = st, y = count, yend = count)) +
geom_segment(linewidth = 2) +
scale_y_continuous(limits = function(x) c(0, max(x))) +
theme_minimal()

For each split:
dtms_split <- lapply(X = new_periods$p_label,
FUN = function(x){
oo <- new_periods %>% filter(p_label == x)
all_pi %>%
filter(date <= oo$et & date >= oo$st) %>%
select(period, ID) %>%
left_join(y = all_pi_tagged, by = c("ID", "period")) %>%
dtm_from_text("CONTENT", "ID")
})
names(dtms_split) <- new_periods$p_label
lda_outs_split <- lapply(X = new_periods$p_label,
FUN = function(x)
LDA(dtms_split[[x]],
k = 6,
method = "Gibbs",
control = list(seed = 306068)))
names(lda_outs_split) <- new_periods$p_label
Chart below presents the results.
show_topics_diff_periods(lda_outs_split, sa = c(.5, 1), st = c(4, 7))

As one can see some topics are prevalent, no matter what granulation:
healthcare, regional issues,
job market and law. Also some topics
were only visible in short periods like agriculture or
education.
Another way to look at change of topics is to look at a single
member. Here are top 5 members which were present in all periods and
authored most interpellations.
all_pi %>%
filter(!is.na(party)) %>%
group_by(AUTHOR_CLEAN) %>%
summarise(dist_period = n_distinct(period), n_int = n()) %>%
filter(dist_period == 3) %>%
top_n(n = 5, wt = n_int) %>% arrange(desc(n_int)) %>%
kable() %>%
kable_styling()
|
AUTHOR_CLEAN
|
dist_period
|
n_int
|
|
Anna Sobecka
|
3
|
786
|
|
Stanisław Stec
|
3
|
555
|
|
Janusz Dobrosz
|
3
|
248
|
|
Grzegorz Kurczuk
|
3
|
240
|
|
Jerzy Budnik
|
3
|
195
|
For further analysis I choose Anna Sobecka since
number of her interpellations (nearly 800) might be enough to perform
topic modelling.
dtms_split_top_member <- lapply(X = period_bands$period,
FUN = function(x){
all_pi %>%
filter(period == x & AUTHOR_CLEAN == "Anna Sobecka") %>%
select(period, ID) %>%
left_join(y = all_pi_tagged, by = c("ID", "period")) %>%
dtm_from_text("CONTENT", "ID")
})
names(dtms_split_top_member) <- period_bands$period
lda_outs_split_top_member <- lapply(X = period_bands$period,
FUN = function(x)
LDA(dtms_split_top_member[[x]],
k = 6,
method = "Gibbs",
control = list(seed = 306068)))
names(lda_outs_split_top_member) <- period_bands$period
As the chart suggests the topic covered by this member vary over
time: from railway, new doctor
training through healthcare and
law up to building roads.
show_topics_diff_periods(lda_outs_split_top_member, sa = c(.5, 1), st = c(5, 8.1))

Evaluation
The text mining project utilizing topic modeling to analyze
interpellations of Polish parliament members shows promising potential
for uncovering significant patterns and themes within the political
discourse. However, further refinement in the data preparation and
extending time horizon could lead to more robust and insightful results,
potentially enhancing the project’s overall effectiveness in capturing
the nuances of parliamentary exchanges and contributing to a deeper
understanding of political dynamics within Poland.
Summary
- Was the purpose achieved?
The purpose of this project was achieved - I’ve successfully
implement topic modelling into interpellations of Polish parliament
members. Additionally I’ve presented some new ways of presenting the
findings.
- Were the assumptions examined?
First assumption: There are specific topics emerge
as dominant themes across the interpellation corpus, indicating their
prevalence and importance within the legislative agenda.
The analyse support that claim. Additionally topics found in data are
consistent with common sense regarding this data.
Second assumption: Importance of topics may vary
across different parliamentary sessions, political factions, or policy
domains, reflecting shifts in public attention, government priorities,
or societal concerns over time.
I believe I’ve also proved this assumption. The topics vary not only
between periods but also intra-period suggesting ongoing fluctuaction of
politicians’ interests.
- Conclusion
Topic modelling based on interpellations from Members of Polish
Parliament is not only possible but also unveils pattern in political
corpus.The result could be improved by improving text porcessing process
or/and by adding more periods for analysis. Tools like clustering or LDA
are perfect for performing such analysis.
Appendix with source code and data set
All the code is available at my Github
repository. Preprocessed files are available on Google
Drive. In order to run this file one needs to:
- clone the repository
- copy data from Google Drive to
data directory inside
the project directory